Appendix B — Assignment B

NumPy

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  3. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on 15th October 2023 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

B.1 Air quality sensors

(An application of broadcasting NumPy arrays)

Air quality sensors are used to measure the amount of contaminants in air. This question will guide you in finding the location of installing 50 air quality sensors in the State of Colorado, such that they are as far away from each other as possible. The approach below is a greedy algorithm to find an approximate Maximin design.

The file colorado_coordinate_grid.txt contains the coordinate-pairs (latitude and longitude) of potential locations for installing an air quality sensor.

B.1.1 Data

Read the file with NumPy. How many coordinate-pairs are there in the file?

Note that:

  1. A coordinate-pair means a latitude-longitude pair.

  2. ‘Air quality sensor’ will be referred as ‘sensor’ in the questions below for brevity.

(4 points)

B.1.2 First sensor

The first sensor is to be installed closest to Denver (closest in terms of Euclidean distance). Find the coordinate-pair of the location where the first sensor will be installed. The coordinate-pair of Denver is: [39.7392\(^{\circ}\) N, 104.9903\(^{\circ}\) W]

Note that the suffixes \(^{\circ}\) N and \(^{\circ}\) W are omitted in the file colorado_coordinate_grid.txt.

Hint: Broadcasting

(4 points)

B.1.3 Second sensor

Find the coordinate-pair of the installation of the next sensor, such that it is as far as possible from the first sensor installed near Denver.

Hint: Broadcasting

(4 points)

B.1.4 First two sensors

Stack the coordinate-pairs of the first and second sensors vertically to obtain a 2 x 2 NumPy array. Name the array as air_sensor_coordinates.

Run the code below to check if your results seem correct. The coordinate-pairs of the two air quality sensors will be marked as blue dots.

(4 points)

Code
import matplotlib.pyplot as plt
def sensor_viz():
    img = plt.imread("colorado.jpg")
    fig, ax = plt.subplots(figsize=(10, 100),dpi=80)
    fig.set_size_inches(10.5, 15)
    ax.imshow(img,extent=[-109, -102, 37, 41])
    plt.scatter(y = air_sensor_coordinates[:,0], x = -air_sensor_coordinates[:,1])
    plt.xlim(-109.05,-101.95)
    plt.ylim(36.95,41.05)
    plt.xlabel("Longitude")
    plt.ylabel("Latitude")
sensor_viz()

B.1.5 Third sensor

Now you need to find the coordinate-pair for installing the third sensor such that it is far away from the two already-installed sensors. Proceed as follows:

  1. Find the minimum distance of each coordinate-pair in colorado_coordinate_grid.txt from the two already installed sensors. For example, if a coordinate-pair is at a distance of 5 units from the first sensor, and 10 units from the second sensor, then its minimum distance from the sensors will be \(\min(5,10) = 5\) units.

  2. Select the coordinate-pair (from colorado_coordinate_grid.txt) whose minimum distance from the two already installed sensors is the maximum.

  3. Stack the coordinate-pair of the third air quality sensor vertically on the array air_sensor_coordinates.

Call the function sensor_viz() to check if your results seem correct. The coordinate-pairs of the three air quality sensors will be marked as blue dots.

Hint:

For step (1) above:

  1. Define a function which computes the distances of a coordinate-pair from all the coordinates of air_sensor_coordinates, and returns the minimum distance.

  2. Apply the function on all the coordinate-pairs in colorado_coordinate_grid.txt using the NumPy function apply_along_axis().

(20 points)

B.1.6 All 50 sensors

You need to find 47 more coordinate-pairs to install air quality sensors well-spread across Colorado. We will generalize the steps in the previous question to proceed as follows:

  1. Suppose you have already found the coordinate-pairs for the installation of i sensors.

  2. Find the minimum distance of each coordinate in colorado_coordinate_grid.txt from the i already installed sensors. For example, if a coordinate-pair is at a distance of \(d_1\) from the first sensor, \(d_2\) from the second sensor,…, and \(d_i\) from the \(i^{th}\) sensor, then its minimum distance from the sensors will be \(min(d_1, d_2, ..., d_i\)).

  3. Select the \(i+1^{th}\) coordinate-pair (from colorado_coordinate_grid.txt) as the one whose minimum distance from the \(i\) already installed sensors is the maximum.

Call the function sensor_viz() to check if your results seem correct. You should see 50 blue dots well spread across Colorado.

(10 points)

B.2 Sales

(An application of matrix multiplication with NumPy arrays)

When the monthly sales of a product are subject to seasonal fluctuations, a curve that approximates the sales formula might have the form:

\[y = a + b*x + c*\sin\bigg(2*\pi*\frac{x}{12}\bigg),\]

where \(x\) is the time since the starting point in months and \(y\) is the monthly sales in USD (million). The term \(a + b*x\) gives the basic sales trend and the \(\sin\) term reflects the seasonal changes in sales. Suppose the model parameters (i.e., \(a\), \(b\), and \(c\)) are estimated and put on the list below for the sales of a certain brand of sunscreen starting June 1, 2017.

Code
model_parameters = [2, 5, 18]

Then, the total monthly sales in June 2017 will be calculated by plugging 1 as \(x\) into the equation.

Using matrix multiplication with NumPy, we wish to estimate the total sales between June 1 2017 and March 1, 2020. (So many models failed to predict sales after that - probably due to covid.)

Proceed as follows.

B.2.1 Create first array

Create a numpy array where the first column is all \(1\)s, the second column is a range of numbers from 1 to the total number of months from June 1 2017 to March 1 2020 and the third column is \(\sin(2*\pi*x/12)\) values with \(x\) values as plugged-in in the second column.

(10 points)

B.2.2 Create second array

Create an array from the list model_parameters.

(3 points)

B.2.3 Multiply arrays

Use matrix multiplication to get the monthly sales estimates for each month in the range: June 1 2017 and March 1, 2020.

(8 points)

B.2.4 Sum array elements

Find the total sales between June 1 2017 and March 1, 2020.

(3 points)

B.3 Exercise minutes

(An application of parallel computation with NumPy)

This problem demonstrates the benefit of generating pseudo random number matrix with NumPy.

The list exercise_minutes below consists of exercise minutes per week of the students of STAT303-1 Fall 2022 class.

We wish to find the 95% confidence interval of mean exercise_minutes, using Bootstrapping.

Bootstrapping is a non-parametric method for obtaining confidence interval. The method is as follows.

  1. Suppose the list exercise_minutes has \(N\) values.

  2. Randomly sample \(N\) values with replacement from exercise_minutes

  3. Find the mean of the \(N\) values obtained in (b)

  4. Repeat steps (b) and (c) 10,000 times

  5. The 95% Confidence interval is the range between the 2.5% and 97.5% percentile values of the 10,000 means obtained in (c)

Code
exercise_minutes=[240, 180, 60, 300, 0, 360, 60, 140, 60, 0, 150, 60, 0, 6, 60, 300, 90, 100, 250, 240, 300, 630, 420, 50, 0, 60, 240, 300, 180, 420, 90, 8, 180, 15, 8, 150, 180, 240, 60, 1200, 210, 360, 720, 240, 360, 240, 250, 180, 600, 120, 60, 200, 360, 120, 20, 250, 60, 420, 420, 150, 350, 180, 14, 60, 450, 180, 300, 1, 180, 7, 180, 300, 70, 40, 300, 60, 180, 225, 90, 300, 240, 200, 60, 200, 360, 3, 200, 300, 90, 60, 180, 120, 10, 0, 200, 700, 300, 300, 5, 60, 420, 300, 240, 200, 180, 180, 120, 300, 375, 60, 240, 180, 180, 90, 240, 180, 15, 300, 60, 120, 120, 240, 400, 200, 60, 480, 120, 300, 180, 250, 280, 7, 600, 240, 0, 420, 60, 2, 280, 300, 60, 0, 250, 180, 540, 30, 210, 2, 90, 120, 180, 240, 540, 400, 120, 150, 360, 180, 200, 180, 30, 60, 300, 80, 60, 210, 315, 360, 275, 200, 150, 180, 200, 150, 0, 1200, 240, 120, 300, 360, 180, 240, 630, 250, 240, 5, 30, 0, 300, 60, 90]

Answer the following questions.

B.3.1 Sequential computation without NumPy

Without using NumPy, compute the:

  1. Confidence interval of mean exercise_minutes, and

  2. Time taken to execute the code

Hints:

  1. You may use the library random.

  2. You may use the library time for computing the time taken to execute the code.

(12 points)

B.3.2 Parallel computation with NumPy

Using NumPy, and without using loops, compute the:

  1. Confidence interval of mean exercise_minutes, and
  2. Time taken to execute the code

(12 points)

B.3.3 Time saving with NumPy

Report the ratio of time taken to execute the code wihout NumPy to the time taken to execute the code with NumPy.

(1 point)